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This  is  a final  report  of  a project  whose  goalr  was  to  use  prosodic  informa- 
tion to  aid  speech  recognition  systems.  Naturalistic  speech  data  was  collected 
and  used  to  test  and  develop  hypotheses  about  the  relationship  of  prosodic 
Information  to  the  syntactic  and  semantic  structure  of  English  sentences. 
Members  of  the  project  interacted  closely  with  other  members  of  the  ARPA  SUR 
project.  A series  of  SUR  notes  describing  acoustic  features  of  prosodies, 
phonological  rules,  stress  rules,  etc.  were  produced. 
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Introduction 


Human  perception  of  speech  involves  the  listener's  knowledge  of  his 
language  and  of  the  world.  Information  from  phonological  and  lexical 
structure,  syntax  and  semantics,  as  well  as  the  listener's  expectations 
about  the  speaker's  behavior,  can  all  affect  the  processing  of  the 
acoustical  signal.  In  contrast,  machine  perception  concentrates  on 
physical  aspects  of  the  signal.  In  the  case  of  isolated  word  recognition 
machines,  such  an  emphasis  seems  justified.  However,  work  in  Artificial 
Intelligence  suggests  that  speech  recognition  system  performance  could 
be  improved  by  the  incorporation  of  syntactic  and  semantic  information. 

To  emphasize  the  Artificial  Intelligence  approach,  such  systems  are 
called  speech  understanding  systems. 

A number  of  syntactic  parsing  systems  for  natural  language  have 
been  produced  but  they  have  generally  involved  written  language  rather 
than  speech.  The  differences  between  speech  and  writing  are  important 
to  a system  in  at  least  three  areas. 

1.  Function  words  and  morphemes  (such  as  i^,  £f,  the,  -ing,  -ed , 
etc.)  are  often  indistinct  in  speech.  Since  parsers  for  written  language 
make  extensive  use  of  function  words  as  delimiters,  these  parsers  cannot 
be  directly  applied  to  spoken  language. 

2.  Prosodies  such  as  intonation,  stress,  pause,  juncture  and  rhythm 
are  important  signals  of  the  syntactic  structure  of  speech.  Such  signals 
permit  speech  to  have  a more  complex  syntactic  structure  than  has  written 
language.  Prosodic  information  can  be  used  by  a parser  for  spoken  English 


3.  The  relationship  between  the  alphabetic  units  and  their  physical 


representation  is  much  less  fixed  in  speech  than  in  printing.  Word 
boundaries  are  very  difficult  to  find.  TVie  acoustic  realization  of 
phonological  elements  is  context  sensitive  and  unstable.  Often  some  of 
the  segmental  units  (phonemes)  which  are  in  the  dictionary  entry  for  a 
word  are  not  pronounced  at  all.  For  these  reasons,  parsers  for  spoken 
language  are  faced  with  much  greater  uncertainty  than  are  parsers  for 
writing. 

Thus  a central  problem  in  speech  understanding  is  to  develop  a sys- 
tem which  takes  account  of  the  differences  between  speech  and  writing. 

In  particular,  the  shift  of  structural  infonnation  from  function  words 
and  morphemes  to  prosodic  features  must  be  incorporated  into  the  grammar. 

In  order  to  incorporate  prosodic  information  into  an  automatic  speech 
recognition  system,  work  in  the  following  four  areas  was  proposed: 

1.  Acoustic  analysis  of  prosodies  and  phonological  rules  - we  pro- 
to collect  limited  protocols  of  spontaneous  and  read  speech  and  to  use 
this  data  to  develop  algorithms  to  locate  phonological  word,  phrase  and 
clause  boundaries. 

2.  Linguistic  analysis  of  prosodies  and  phonological  rules  - we 
proposed  to  survey  and  integrate  various  linguistic  studies  of  intonation 
and  rhythm,  cast  these  hypotheses  against  our  empirical  data  and  generate 
new  hypotheses.  We  also  proposed  to  collect  phonological  rules  with  the 
goal  of  developing  a system  for  parsing  or  Inverting  such  rules. 

3.  Integration  of  prosodic  information  into  a SUR  system  - we  pro- 

t 

I 

posed  to  put  a simple  prosodic  component  Into  a SUR  system,  test  it  and 
I then  extend  it  as  a result  of  our  acoustic  research  findings. 
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4.  System  Development  - We  proposed  to  develop  a PDF  11 
facility  which  could  access  various  SUR  systems  over  the  ARPANET. 
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RESULTS 

The  ARPA  speech  project  was  originally  organized  into  5 major 
project.*^  conducted  by  organizations  with  experience  in  system  design 
and  supported  by  4 smaller  projects  with  expertese  in  linguistics  and 
speech.  By  the  end  of  the  5 year  project,  one  of  the  systems  had 
actually  achieving  the  original  design  goals  (as  re-interpreted  by  the 
members  of  the  project) . 

The  project  had  started  with  the  view  that  artificial  intelligence 
techniques  had  advanced  to  the  point  where  they  could  offer  the  techno- 
logical basis  for  some  practical  application.  Furthermore,  developments 
in  computational  linguistics,  speech  science  and  signal  processing 
suggested  that  speech  recognition  or  "speech  understanding"  might  be 
such  a practical  applications. 

The  project  started  with  strength  in  artificial  intelligence,  speech 
and  linguistics.  The  first  two  years  consisted  of  a great  deal  of 
teaching  on  the  part  of  the  4 smaller  groups.  There  is  no  question  that 
the  result  was  a much  more  "linguistic",  more  principled,  less  ad  hoc 
design  for  all  of  the  major  systems. 

The  next  3 years  consisted  of  successive  attempts  by  the  major  systems 
to  incorporate  more  and  more  of  what  was  known  about  language.  However, 

I 

in  a sense  none  of  them  came  close  to  incorporating  even  a fraction  of 
the  well  known  linguistic  facts  about  English.  Practical  tasks  of  building 
systems  and  of  incorporating  static  rules  into  a dynamic  procedure  over- 
whelmed their  good  intentions. 
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The  final  year,  of  course,  was  a frantic  effort  to  cut  the  crap  and 
make  something  work.  The  irony  of  the  project  was  that  the  only  system 
that  did  meet  the  goals  was  a simple  combination  of  statistics  and 
low-level  speech  science  — it  had  no  artificial  intelligence  and  no 
linguistics,  it  was  a pure  engineering  system  and  it  was  written  almost 
as  a side  effort  by  a couple  of  students. 

During  the  course  of  the  project,  I pushed  very  strongly  toward  an 
even  more  theoretically  correct  system.  In  the  beginning,  I naively 
thought  that  the  systems  could  incorporate  a larger  part  of  what  was 
known  about  language.  I pushed  for  studies  of  dialect,  communication 
modes,  natural  syntax,  etc.  — all  of  which  were  quite  irrelevant  to 
the  types  of  systems  that  were  finally  produced.  Had  all  of  us  been  more 
goal  oriented,  we  should  have  produced  much  more  limited,  more  successful, 
much  less  interesting  systems.  The  project  which  -jould  have  resulted 
would,  of  course,  have  had  a much  smaller  long  term  impact. 

Viewed  from  the  prospective  of  achieving  the  group  goals,  my  own 
work  was  certainly  counter  productive.  It  was  designed  to  push  the 
systems  in  directions  which,  in  retrospect,  were  the  opposite  of  those 
which  "worked".  I still  do  not  know  wliether  a serious,  10  year, 
theoretically  correct  project  would  produce  a very  good  system  or  merely 
a slower  and  less  accurate  system.  However,  I do  know  from  the  limited 
success  that  we  did  inadvertently  achieve  that  Wixenbaum's  criticism  of 
the  speech  project  and  its  social  implications  was  absolutely  correct. 

The  government  does  not  need  more  word  spotters.  In  any  case,  the 
following  are  some  of  the  things  that  we  did  during  the  project: 


We  added  a toy  prosodic  component  to  the  toy  hearsay  I system  at  CMU. 
While  this  showed  that  we  could  use  the  ARPA  net  better  than  more  linguists 


} 

and  were  not  bad  at  understanding  other  people's  code  and  then  modifying 
it,  it  didn't  have  much  effect  on  the  system  design.  On  the  other  hand, 

I think  that  it  resulted  in  lu^esay  I having  the  only  non-empty 
"prosodic  component"  among  all  of  the  final  systems. 

We  looked  in  great  detail  at  the  various  BBN  grammars.  We  did 
add  some  pseudo-prosodies  to  the  LUNAR  system  which  at  least  showed 
that  we  knew  LISP  better  than  most  linguists.  We  also  discovered  why 
the  original  LUNAR  grammar,  while  a fantastic  contribution  to  text 
processing,  would  never  have  worked  as  a speech  recognition  component. 

The  BBN  project  discovered  the  same  fact  independently. 

We  also  looked  at  the  later  BBN  grammars  in  terms  of  what  kinds 
of  sentences  they  would  handle.  Our  results  were  not  especially  popular. 

However,  by  the  end  of  the  project,  BBN  had  caught  up  with  CMU  by 
abandoning  the  idea  of  grammar  altogether.  Neither,  however,  was  able 
to  surpass  the  higher  levels  of  the  SDS  system. 

V/e  spent  a lot  of  time  trying  to  develop  and  use  a speech  processing 
system  on  the  ARPA  net.  P’rom  this,  I learned  that  systems  programming  is 
fun  but  can  overwhelm  the  attempt  to  do  anything  "useful".  However,  Tovar 
contributed  a great  deal  to  the  general  net  community  and  to  making  Unix 

a reasonable  system  for  speech  research.  We  also  learned  never  to  trust  ^ 

I 

anyone  (ARP.A)  with  a product  to  sell. 

i 

f 

On  a more  scientific  level,  Alan  Cole  did  a great  deal  of  good  work  i 

using  Hearsay  II  and  the  CMU  speech  data.  This  work  on  phonological  rule 
analysis  is  continuing  to  some  extent  at  IBM  where  Alan  now  works.  Had 
the  project  continued,  his  work  would  have  had  a significant  impact  — 
especially  as  it  was  in  the  spirit  of  the  final  statistical  process  that 
actually  worked. 
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Participating  Scientific  Personnel 


The  following  people  participated  at  various  times  during  the 
project. 

Michael  H.  O'Malley 

I Alan  Cole 

Malcah  Yaeger 

Ron  Bader 

i 

^ Greg  Shenant 

1 Dean  Kloker 

' John  Moch 

Richard  Gerould 

r 

I David  King 

Cole  and  Kloker  will  receive  Ph.D.  degrees  for  their  work  on 
the  project.  King,  Gerould  and  Bader  have  received  Masters  Degrees  for 
their  work. 
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